EDA¶

Checking Information of Dataset¶

In [104]:
import pandas as pd
df = pd.read_csv('source.csv')

print(df.head())
print(df.info())
print(df.describe())
                       time  latitude  longitude  depth  mag magType  nst  \
0  2018-11-27T14:34:20.900Z   48.3780   154.9620  35.00  4.9      mb  NaN   
1  2018-11-26T23:33:50.630Z   36.0733   139.7830  48.82  4.8     mww  NaN   
2  2018-11-26T13:04:02.250Z   38.8576   141.8384  50.56  4.5      mb  NaN   
3  2018-11-26T05:20:16.440Z   50.0727   156.1420  66.34  4.6      mb  NaN   
4  2018-11-25T09:19:05.010Z   33.9500   134.4942  38.19  4.6      mb  NaN   

     gap   dmin   rms  ...                   updated  \
0   92.0  5.044  0.63  ...  2018-11-27T16:06:33.040Z   
1  113.0  1.359  1.13  ...  2018-11-27T16:44:22.223Z   
2  145.0  1.286  0.84  ...  2018-11-26T23:52:21.074Z   
3  128.0  3.191  0.62  ...  2018-11-26T08:13:58.040Z   
4  104.0  0.558  0.61  ...  2018-11-25T23:24:52.615Z   

                                  place        type horizontalError  \
0  269km SSW of Severo-Kuril'sk, Russia  earthquake             7.6   
1               3km SSW of Sakai, Japan  earthquake             6.0   
2            26km SSE of Ofunato, Japan  earthquake             8.4   
3     67km S of Severo-Kuril'sk, Russia  earthquake             9.7   
4         9km SW of Komatsushima, Japan  earthquake             3.4   

  depthError  magError  magNst    status  locationSource magSource  
0        1.7     0.036   248.0  reviewed              us        us  
1        6.1     0.071    19.0  reviewed              us        us  
2        9.5     0.156    12.0  reviewed              us        us  
3        7.8     0.045   151.0  reviewed              us        us  
4       10.1     0.132    17.0  reviewed              us        us  

[5 rows x 22 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14092 entries, 0 to 14091
Data columns (total 22 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   time             14092 non-null  object 
 1   latitude         14092 non-null  float64
 2   longitude        14092 non-null  float64
 3   depth            14092 non-null  float64
 4   mag              14092 non-null  float64
 5   magType          14092 non-null  object 
 6   nst              10483 non-null  float64
 7   gap              13310 non-null  float64
 8   dmin             3607 non-null   float64
 9   rms              14014 non-null  float64
 10  net              14092 non-null  object 
 11  id               14092 non-null  object 
 12  updated          14092 non-null  object 
 13  place            14092 non-null  object 
 14  type             14092 non-null  object 
 15  horizontalError  2800 non-null   float64
 16  depthError       9040 non-null   float64
 17  magError         3431 non-null   float64
 18  magNst           11048 non-null  float64
 19  status           14092 non-null  object 
 20  locationSource   14092 non-null  object 
 21  magSource        14092 non-null  object 
dtypes: float64(12), object(10)
memory usage: 2.4+ MB
None
           latitude     longitude         depth           mag           nst  \
count  14092.000000  14092.000000  14092.000000  14092.000000  10483.000000   
mean      37.410294    142.980441     51.364838      4.817045    117.352094   
std        6.605873      6.552510     76.603810      0.378618    123.273889   
min       23.532900    124.293000      0.000000      4.500000      5.000000   
25%       33.147675    141.071000     14.400000      4.600000     36.000000   
50%       37.357000    142.452100     35.000000      4.700000     69.000000   
75%       42.271325    144.432000     50.372500      4.900000    153.000000   
max       50.816100    158.818000    683.360000      9.100000    929.000000   

                gap         dmin           rms  horizontalError   depthError  \
count  13310.000000  3607.000000  14014.000000      2800.000000  9040.000000   
mean     104.272149     2.359796      0.876561         7.288607     7.822920   
std       37.893474     1.658681      0.203787         2.263028     5.861948   
min        8.000000     0.038000      0.120000         1.400000     0.000000   
25%       78.000000     1.109000      0.740000         5.800000     4.400000   
50%      112.700000     1.979000      0.850000         7.100000     6.200000   
75%      130.900000     3.122500      0.990000         8.500000     9.600000   
max      306.600000    18.781000      1.880000        25.600000    70.700000   

          magError        magNst  
count  3431.000000  11048.000000  
mean      0.095182     48.590695  
std       0.060710     70.233727  
min       0.019000      1.000000  
25%       0.054000      9.000000  
50%       0.079000     23.000000  
75%       0.118000     57.000000  
max       0.555000    941.000000  

Convert to Datetime¶

In [73]:
df['time'] = pd.to_datetime(df['time'], errors='coerce')
df['updated'] = pd.to_datetime(df['updated'], errors='coerce')

Handling Missing Earthquake Data in DataFrame¶

In [100]:
# Insert checks and handling of missing values here
print("Total missing values:\n", df.isnull().sum())

original_len = len(df)

# drop rows if latitude, longitude, mag, or depth are missing
df.dropna(subset=['latitude', 'longitude', 'mag', 'depth'], inplace=True)

# Check how many rows remain after dropping
print(f"DataFrame length before dropna: {original_len}")
print(f"DataFrame length after dropna: {len(df)}")
Total missing values:
 time                   0
latitude               0
longitude              0
depth                  0
mag                    0
magType                0
nst                 3609
gap                  782
dmin               10485
rms                   78
net                    0
id                     0
updated                0
place                  0
type                   0
horizontalError    11292
depthError          5052
magError           10661
magNst              3044
status                 0
locationSource         0
magSource              0
dtype: int64
DataFrame length before dropna: 14092
DataFrame length after dropna: 14092

There was no missing value.

Visualizing Earthquake Magnitudes, Depths, Temporal Trends, and Epicenters¶

In [71]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.histplot(data=df, x='mag', binwidth=0.2)
plt.title('Distribution of Earthquake Magnitudes')
plt.show()
No description has been provided for this image

Most earthquakes have magnitudes between 4.0 and 5.0. As magnitude increases, the number of events drops sharply, showing a right-skewed distribution.

In [5]:
sns.histplot(data=df, x='depth', binwidth=10)
plt.title('Distribution of Earthquake Depths')
plt.show()
No description has been provided for this image

Most earthquakes occur at shallow depths, mainly between 0 and 100 km. Deeper earthquakes are much less frequent, showing a right-skewed distribution

In [6]:
df['year'] = df['time'].dt.year
yearly_counts = df.groupby('year')['id'].count()
plt.plot(yearly_counts.index, yearly_counts.values)
plt.title('Number of Earthquakes by Year')
plt.show()
No description has been provided for this image

Scatterplot of Depth vs. Magnitude¶

In [8]:
sns.scatterplot(data=df, x='mag', y='depth', alpha=0.5)
plt.title('Depth vs. Magnitude')
plt.show()
No description has been provided for this image

Most earthquakes occur at shallow depths. There is no clear correlation between depth and magnitude.

In [9]:
corr = df[['nst','gap','dmin','rms','horizontalError','depthError','magError']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
No description has been provided for this image
  • There is a strong negative correlation between nst (number of stations) and gap (-0.66):
    • → When more stations are available, the azimuthal gap becomes smaller, indicating better coverage and likely higher data accuracy.
  • nst also shows a moderate negative correlation with depthError (-0.37):
    • → More stations contribute to more accurate depth estimation. This suggests improving station density could reduce depth estimation error.
  • gap is moderately positively correlated with magError (0.44):
    • → A larger azimuthal gap may lead to less accurate magnitude calculations, likely due to incomplete directional coverage.
  • dmin (minimum distance to station) is positively correlated with horizontalError (0.52):
    • → The farther the station is from the epicenter, the more likely horizontal location error increases. This emphasizes the importance of having nearby sensors.
  • magError is weakly positively correlated with several variables, such as nst (0.44), depthError (0.22), and horizontalError (0.083):
    • → While not extremely strong, it indicates that multiple observational factors can collectively influence magnitude accuracy.

Machine Learning¶

Clustering of Earthquakes (Spatial Pattern Analysis)¶

In [97]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.colors as mcolors

df = pd.read_csv("source.csv")

# Clustering Columns
data_for_cluster = df[['latitude', 'longitude', 'depth']].dropna()

# Standarize
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_for_cluster)

# KMeans
k = 4 # the number of plates
kmeans = KMeans(n_clusters=k, random_state=42)
labels = kmeans.fit_predict(data_scaled)


centroids = pd.DataFrame(kmeans.cluster_centers_, columns=['latitude', 'longitude', 'depth'])
centroids['cluster_id'] = range(k)
print("Cluster Centers (standardized):\n", centroids)

# Add labels
clustered_df = data_for_cluster.copy()
clustered_df['cluster_label'] = labels

print("\nCluster Means:")
print(clustered_df.groupby('cluster_label').mean())


cluster_color_map = {
    0: '#1f78b4',
    1: '#33a02c',
    2: '#e31a1c',
    3: '#ff7f00' 
}

# クラスタラベルに基づいて色を割り当て
color_list = [cluster_color_map[label] for label in labels]

# 2D 可視化(緯度・経度)
plt.figure(figsize=(8, 6))
plt.scatter(
    data_for_cluster['longitude'],
    data_for_cluster['latitude'],
    c=color_list,
    alpha=0.6,
    edgecolors='white',
    linewidths=0.5
)
plt.title(f"KMeans Clustering (k={k})")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.show()

# 3D 可視化(緯度・経度・深さ)
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(
    data_for_cluster['longitude'],
    data_for_cluster['latitude'],
    data_for_cluster['depth'],
    c=color_list,
    edgecolors='white',
    linewidths=0.5
)
ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
ax.set_zlabel("Depth")
plt.title(f"KMeans Clustering (k={k})")
plt.show()
Cluster Centers (standardized):
    latitude  longitude     depth  cluster_id
0 -0.249139  -0.120414 -0.208154           0
1 -1.332342  -2.117011 -0.127855           1
2  1.363400   1.385884 -0.079107           2
3 -0.238750  -0.557242  4.626822           3

Cluster Means:
                latitude   longitude       depth
cluster_label                                   
0              35.764571  142.191454   35.419984
1              28.609327  129.109195   41.570978
2              46.416419  152.061141   45.305156
3              35.833199  139.329235  405.784451
No description has been provided for this image
No description has been provided for this image

Mapping Major Earthquakes in Japan with Folium¶

In [99]:
import folium

# Folium map
m = folium.Map(location=[38, 142], zoom_start=4, tiles='Esri.WorldImagery')

cluster_colors = ['#1f78b4', '#33a02c', '#e31a1c', '#ff7f00']

for _, row in clustered_df.iterrows():
    tooltip = f"""
    <b>Depth:</b> {row['depth']} km<br/>
    <b>Cluster:</b> {row['cluster_label']}
    """
    folium.CircleMarker(
        location=[row['latitude'], row['longitude']],
        radius=3,
        color='white',
        weight=0.4,
        fill=True,
        fill_color=cluster_colors[int(row['cluster_label'])],
        fill_opacity=0.4,
        tooltip=tooltip
    ).add_to(m)

# Legend
legend_html = '''
<div style="position: fixed; bottom: 30px; left: 30px; width: 180px; height: 140px;
    background-color: white; z-index:9999; font-size:14px;
    border:2px solid grey; padding:10px;">
    <b>Cluster Label</b><br>
    <i style="background:#e31a1c; width:10px; height:10px; display:inline-block;"></i> Cluster 0<br>
    <i style="background:#1f78b4; width:10px; height:10px; display:inline-block;"></i> Cluster 1<br>
    <i style="background:#ff7f00; width:10px; height:10px; display:inline-block;"></i> Cluster 2<br>
    <i style="background:#33a02c; width:10px; height:10px; display:inline-block;"></i> Cluster 3
    
    
</div>
'''
m.get_root().html.add_child(folium.Element(legend_html))
m
Out[99]:
Make this Notebook Trusted to load map: File -> Trust Notebook